AAAI.2023 - Speech and Natural Language Processing | Cool Papers

#1 Generalized Category Discovery with Decoupled Prototypical Network [PDF²] [Copy] [Kimi] [REL]

Authors: Wenbin An, Feng Tian, Qinghua Zheng, Wei Ding, Qianying Wang, Ping Chen

Generalized Category Discovery (GCD) aims to recognize both known and novel categories from a set of unlabeled data, based on another dataset labeled with only known categories. Without considering differences between known and novel categories, current methods learn about them in a coupled manner, which can hurt model's generalization and discriminative ability. Furthermore, the coupled training approach prevents these models transferring category-specific knowledge explicitly from labeled data to unlabeled data, which can lose high-level semantic information and impair model performance. To mitigate above limitations, we present a novel model called Decoupled Prototypical Network (DPN). By formulating a bipartite matching problem for category prototypes, DPN can not only decouple known and novel categories to achieve different training targets effectively, but also align known categories in labeled and unlabeled data to transfer category-specific knowledge explicitly and capture high-level semantics. Furthermore, DPN can learn more discriminative features for both known and novel categories through our proposed Semantic-aware Prototypical Learning (SPL). Besides capturing meaningful semantic information, SPL can also alleviate the noise of hard pseudo labels through semantic-weighted soft assignment. Extensive experiments show that DPN outperforms state-of-the-art models by a large margin on all evaluation metrics across multiple benchmark datasets. Code and data are available at https://github.com/Lackel/DPN.

Subject: AAAI.2023 - Speech and Natural Language Processing

#2 Structured Case-Based Reasoning for Inference-Time Adaptation of Text-to-SQL Parsers [PDF²] [Copy] [Kimi] [REL]

Authors: Abhijeet Awasthi, Soumen Chakrabarti, Sunita Sarawagi

Inference-time adaptation methods for semantic parsing are useful for leveraging examples from newly-observed domains without repeated fine-tuning. Existing approaches typically bias the decoder by simply concatenating input-output example pairs (cases) from the new domain at the encoder’s input in a Seq-to-Seq model. Such methods cannot adequately leverage the structure of logical forms in the case examples. We propose StructCBR, a structured case-based reasoning approach, which leverages subtree-level similarity between logical forms of cases and candidate outputs, resulting in better decoder decisions. For the task of adapting Text-to-SQL models to unseen schemas, we show that exploiting case examples in a structured manner via StructCBR offers consistent performance improvements over prior inference-time adaptation methods across five different databases. To the best of our knowledge, we are the first to attempt inference-time adaptation of Text-to-SQL models, and harness trainable structured similarity between subqueries.

Subject: AAAI.2023 - Speech and Natural Language Processing

#3 SegFormer: A Topic Segmentation Model with Controllable Range of Attention [PDF²] [Copy] [Kimi] [REL]

Authors: Haitao Bai, Pinghui Wang, Ruofei Zhang, Zhou Su

Topic segmentation aims to reveal the latent structure of a document and divide it into multiple parts. However, current neural solutions are limited in the context modeling of sentences and feature representation of candidate boundaries. This causes the model to suffer from inefficient sentence context encoding and noise information interference. In this paper, we design a new text segmentation model SegFormer with unidirectional attention blocks to better model sentence representations. To alleviate the problem of noise information interference, SegFormer uses a novel additional context aggregator and a topic classification loss to guide the model to aggregate the information within the appropriate range. In addition, SegFormer applies an iterative prediction algorithm to search for optimal boundaries progressively. We evaluate SegFormer's generalization ability, multilingual ability, and application ability on multiple challenging real-world datasets. Experiments show that our model significantly improves the performance by 7.5% on the benchmark WIKI-SECTION compared to several strong baselines. The application of SegFormer to a real-world dataset to separate normal and advertisement segments in product marketing essays also achieves superior performance in the evaluation with other cutting-edge models.

Subject: AAAI.2023 - Speech and Natural Language Processing

#4 Rich Event Modeling for Script Event Prediction [PDF¹] [Copy] [Kimi] [REL]

Authors: Long Bai, Saiping Guan, Zixuan Li, Jiafeng Guo, Xiaolong Jin, Xueqi Cheng

Script is a kind of structured knowledge extracted from texts, which contains a sequence of events. Based on such knowledge, script event prediction aims to predict the subsequent event. To do so, two aspects should be considered for events, namely, event description (i.e., what the events should contain) and event encoding (i.e., how they should be encoded). Most existing methods describe an event by a verb together with a few core arguments (i.e., subject, object, and indirect object), which are not precise enough. In addition, existing event encoders are limited to a fixed number of arguments, which are not flexible enough to deal with extra information. Thus, in this paper, we propose the Rich Event Prediction (REP) framework for script event prediction. Fundamentally, it is based on the proposed rich event description, which enriches the existing ones with three kinds of important information, namely, the senses of verbs, extra semantic roles, and types of participants. REP contains an event extractor to extract such information from texts. Based on the extracted rich information, a predictor then selects the most probable subsequent event. The core component of the predictor is a transformer-based event encoder that integrates the above information flexibly. Experimental results on the widely used Gigaword Corpus show the effectiveness of the proposed framework.

Subject: AAAI.2023 - Speech and Natural Language Processing

#5 Avocodo: Generative Adversarial Network for Artifact-Free Vocoder [PDF²] [Copy] [Kimi] [REL]

Authors: Taejun Bak, Junmo Lee, Hanbin Bae, Jinhyeok Yang, Jae-Sung Bae, Young-Sun Joo

Neural vocoders based on the generative adversarial neural network (GAN) have been widely used due to their fast inference speed and lightweight networks while generating high-quality speech waveforms. Since the perceptually important speech components are primarily concentrated in the low-frequency bands, most GAN-based vocoders perform multi-scale analysis that evaluates downsampled speech waveforms. This multi-scale analysis helps the generator improve speech intelligibility. However, in preliminary experiments, we discovered that the multi-scale analysis which focuses on the low-frequency bands causes unintended artifacts, e.g., aliasing and imaging artifacts, which degrade the synthesized speech waveform quality. Therefore, in this paper, we investigate the relationship between these artifacts and GAN-based vocoders and propose a GAN-based vocoder, called Avocodo, that allows the synthesis of high-fidelity speech with reduced artifacts. We introduce two kinds of discriminators to evaluate speech waveforms in various perspectives: a collaborative multi-band discriminator and a sub-band discriminator. We also utilize a pseudo quadrature mirror filter bank to obtain downsampled multi-band speech waveforms while avoiding aliasing. According to experimental results, Avocodo outperforms baseline GAN-based vocoders, both objectively and subjectively, while reproducing speech with fewer artifacts.

Subject: AAAI.2023 - Speech and Natural Language Processing

#6 End-to-End Deep Reinforcement Learning for Conversation Disentanglement [PDF¹] [Copy] [Kimi] [REL]

Authors: Karan Bhukar, Harshit Kumar, Dinesh Raghu, Ajay Gupta

Collaborative Communication platforms (e.g., Slack) support multi-party conversations which contain a large number of messages on shared channels. Multiple conversations intermingle within these messages. The task of conversation disentanglement is to cluster these intermingled messages into conversations. Existing approaches are trained using loss functions that optimize only local decisions, i.e. predicting reply-to links for each message and thereby creating clusters of conversations. In this work, we propose an end-to-end reinforcement learning (RL) approach that directly optimizes a global metric. We observe that using existing global metrics such as variation of information and adjusted rand index as a reward for the RL agent deteriorates its performance. This behaviour is because these metrics completely ignore the reply-to links between messages (local decisions) during reward computation. Therefore, we propose a novel thread-level reward function that captures the global metric without ignoring the local decisions. Through experiments on the Ubuntu IRC dataset, we demonstrate that the proposed RL model improves the performance on both link-level and conversation-level metrics.

Subject: AAAI.2023 - Speech and Natural Language Processing

#7 Self-Supervised Logic Induction for Explainable Fuzzy Temporal Commonsense Reasoning [PDF¹] [Copy] [Kimi] [REL]

Authors: Bibo Cai, Xiao Ding, Zhouhao Sun, Bing Qin, Ting Liu, Baojun wang, Lifeng Shang

Understanding temporal commonsense concepts, such as times of occurrence and durations is crucial for event-centric language understanding. Reasoning about such temporal concepts in a complex context requires reasoning over both the stated context and the world knowledge that underlines it. A recent study shows massive pre-trained LM still struggle with such temporal reasoning under complex contexts (e.g., dialog) because they only implicitly encode the relevant contexts and fail to explicitly uncover the underlying logical compositions for complex inference, thus may not be robust enough. In this work, we propose to augment LMs with the temporal logic induction ability, which frames the temporal reasoning by defining three modular components: temporal dependency inducer and temporal concept defuzzifier and logic validator. The former two components disentangle the explicit/implicit dependency between temporal concepts across context (before, after, ...) and the specific meaning of fuzzy temporal concepts, respectively, while the validator combines the intermediate reasoning clues for robust contextual reasoning about the temporal concepts. Extensive experimental results on TIMEDIAL, a challenging dataset for temporal reasoning over dialog, show that our method, Logic Induction Enhanced Contextualized TEmporal Reasoning (LECTER), can yield great improvements over the traditional language model for temporal reasoning.

Subject: AAAI.2023 - Speech and Natural Language Processing

#8 Zero-Shot Cross-Lingual Event Argument Extraction with Language-Oriented Prefix-Tuning [PDF²] [Copy] [Kimi¹] [REL]

Authors: Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, Jun Zhao

Event argument extraction (EAE) aims to identify the arguments of a given event, and classify the roles that those arguments play. Due to high data demands of training EAE models, zero-shot cross-lingual EAE has attracted increasing attention, as it greatly reduces human annotation effort. Some prior works indicate that generation-based methods have achieved promising performance for monolingual EAE. However, when applying existing generation-based methods to zero-shot cross-lingual EAE, we find two critical challenges, including Language Discrepancy and Template Construction. In this paper, we propose a novel method termed as Language-oriented Prefix-tuning Network (LAPIN) to address the above challenges. Specifically, we devise a Language-oriented Prefix Generator module to handle the discrepancies between source and target languages. Moreover, we leverage a Language-agnostic Template Constructor module to design templates that can be adapted to any language. Extensive experiments demonstrate that our proposed method achieves the best performance, outperforming the previous state-of-the-art model by 4.8% and 2.3% of the average F1-score on two multilingual EAE datasets.

Subject: AAAI.2023 - Speech and Natural Language Processing

#9 RPA: Reasoning Path Augmentation in Iterative Retrieving for Multi-Hop QA [PDF¹] [Copy] [Kimi] [REL]

Authors: Ziyi Cao, Bingquan Liu, Shaobo Li

Multi-hop questions are associated with a series of justifications, and one needs to obtain the answers by following the reasoning path (RP) that orders the justifications adequately. So reasoning path retrieval becomes a critical preliminary stage for multi-hop Question Answering (QA). Within the RP, two fundamental challenges emerge for better performance: (i) what the order of the justifications in the RP should be, and (ii) what if the wrong justification has been in the path. In this paper, we propose Reasoning Path Augmentation (RPA), which uses reasoning path reordering and augmentation to handle the above two challenges, respectively. Reasoning path reordering restructures the reasoning by targeting the easier justification first but difficult one later, in which the difficulty is determined by the overlap between query and justifications since the higher overlap means more lexical relevance and easier searchable. Reasoning path augmentation automatically generates artificial RPs, in which the distracted justifications are inserted to aid the model recover from the wrong justification. We build RPA with a naive pre-trained model and evaluate RPA on the QASC and MultiRC datasets. The evaluation results demonstrate that RPA outperforms previously published reasoning path retrieval methods, showing the effectiveness of the proposed methods. Moreover, we present detailed experiments on how the orders of justifications and the percent of augmented paths affect the question- answering performance, revealing the importance of polishing RPs and the necessity of augmentation.

Subject: AAAI.2023 - Speech and Natural Language Processing

#10 Leveraging Modality-Specific Representations for Audio-Visual Speech Recognition via Reinforcement Learning [PDF²] [Copy] [Kimi¹] [REL]

Authors: Chen Chen, Yuchen Hu, Qiang Zhang, Heqing Zou, Beier Zhu, Eng Siong Chng

Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.

Subject: AAAI.2023 - Speech and Natural Language Processing

#11 Converge to the Truth: Factual Error Correction via Iterative Constrained Editing [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Jiangjie Chen, Rui Xu, Wenxuan Zeng, Changzhi Sun, Lei Li, Yanghua Xiao

Given a possibly false claim sentence, how can we automatically correct it with minimal editing? Existing methods either require a large number of pairs of false and corrected claims for supervised training or do not handle well errors spanning over multiple tokens within an utterance. In this paper, we propose VENCE, a novel method for factual error correction (FEC) with minimal edits. VENCE formulates the FEC problem as iterative sampling editing actions with respect to a target density function. We carefully design the target function with predicted truthfulness scores from an offline trained fact verification model. VENCE samples the most probable editing positions based on back-calculated gradients of the truthfulness score concerning input tokens and the editing actions using a distantly-supervised language model (T5). Experiments on a public dataset show that VENCE improves the well-adopted SARI metric by 5.3 (or a relative improvement of 11.8%) over the previous best distantly-supervised methods.

Subject: AAAI.2023 - Speech and Natural Language Processing

#12 Adversarial Word Dilution as Text Data Augmentation in Low-Resource Regime [PDF¹] [Copy] [Kimi] [REL]

Authors: Junfan Chen, Richong Zhang, Zheyan Luo, Chunming Hu, Yongyi Mao

Data augmentation is widely used in text classification, especially in the low-resource regime where a few examples for each class are available during training. Despite the success, generating data augmentations as hard positive examples that may increase their effectiveness is under-explored. This paper proposes an Adversarial Word Dilution (AWD) method that can generate hard positive examples as text data augmentations to train the low-resource text classification model efficiently. Our idea of augmenting the text data is to dilute the embedding of strong positive words by weighted mixing with unknown-word embedding, making the augmented inputs hard to be recognized as positive by the classification model. We adversarially learn the dilution weights through a constrained min-max optimization process with the guidance of the labels. Empirical studies on three benchmark datasets show that AWD can generate more effective data augmentations and outperform the state-of-the-art text data augmentation methods. The additional analysis demonstrates that the data augmentations generated by AWD are interpretable and can flexibly extend to new examples without further training.

Subject: AAAI.2023 - Speech and Natural Language Processing

#13 CP-Rec: Contextual Prompting for Conversational Recommender Systems [PDF¹] [Copy] [Kimi] [REL]

Authors: Keyu Chen, Shiliang Sun

The conversational recommender system (CRS) aims to provide high-quality recommendations through interactive dialogues. However, previous CRS models have no effective mechanisms for task planning and topic elaboration, and thus they hardly maintain coherence in multi-task recommendation dialogues. Inspired by recent advances in prompt-based learning, we propose a novel contextual prompting framework for dialogue management, which optimizes prompts based on context, topics, and user profiles. Specifically, we develop a topic controller to sequentially plan the subtasks, and a prompt search module to construct context-aware prompts. We further adopt external knowledge to enrich user profiles and make knowledge-aware recommendations. Incorporating these techniques, we propose a conversational recommender system with contextual prompting, namely CP-Rec. Experimental results demonstrate that it achieves state-of-the-art recommendation accuracy and generates more coherent and informative conversations.

Subject: AAAI.2023 - Speech and Natural Language Processing

#14 A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness. The diversity of human speech, however, often goes beyond the coverage of these corpora. We believe the ability to handle such diversity is crucial for AI systems to achieve human-level communication. Our work explores the use of more abundant real-world data for building speech synthesizers. We train TTS systems using real-world speech from YouTube and podcasts. We observe the mismatch between training and inference alignments in mel-spectrogram based autoregressive models, leading to unintelligible synthesis, and demonstrate that learned discrete codes within multiple code groups effectively resolves this issue. We introduce our MQTTS system whose architecture is designed for multiple code generation and monotonic alignment, along with the use of a clean silence prompt to improve synthesis quality. We conduct ablation analyses to identify the efficacy of our methods. We show that MQTTS outperforms existing TTS systems in several objective and subjective measures.

Subject: AAAI.2023 - Speech and Natural Language Processing

#15 Learning to Memorize Entailment and Discourse Relations for Persona-Consistent Dialogues [PDF¹] [Copy] [Kimi] [REL]

Authors: Ruijun Chen, Jin Wang, Liang-Chih Yu, Xuejie Zhang

Maintaining engagement and consistency is particularly important in dialogue systems. Existing works have improved the performance of dialogue systems by intentionally learning interlocutor personas with sophisticated network structures. One issue with this approach is that it requires more personal corpora with annotations. Additionally, these models typically perform the next utterance prediction to generate a response but neglect the discourse coherence in the entire conversation. To address these issues, this study proposes a method of learning to memorize entailment and discourse relations for persona-consistent dialogue tasks. Entailment text pairs in natural language inference dataset were applied to learn latent entailment relations as external memories by premise-to-hypothesis generation task. Furthermore, an internal memory with a similar architecture was applied to the discourse information in the dialogue. Placing orthogonality restrictions on these two memory spaces ensures that the latent entailment relations remain dialogue-independent. Both memories collaborate to obtain entailment and discourse representation for the generation, allowing a deeper understanding of both consistency and coherence. Experiments on two large public datasets, PersonaChat and DSTC7-AVSD, demonstrated the effectiveness of the proposed method. Both automatic and human evaluations indicate that the proposed model outperforms several strong baselines in terms of both persona consistency and response coherence. Our source code is availabled at https://github.com/Chenrj233/LMEDR.

Subject: AAAI.2023 - Speech and Natural Language Processing

#16 Preference-Controlled Multi-Objective Reinforcement Learning for Conditional Text Generation [PDF¹] [Copy] [Kimi] [REL]

Authors: Wenqing Chen, Jidong Tian, Caoyun Fan, Yitian Li, Hao He, Yaohui Jin

Conditional text generation is to generate text sequences conditioning on linguistic or non-linguistic data. The main line of existing work proposed deterministic models to improve the fidelity of the generated text but often ignored the diversity. Another line relied on conditional variational auto-encoders (CVAEs), which increased the diversity over their deterministic backbones. However, CVAEs regard diversity as an implicit objective and may not be optimal. In this paper, we raise two questions: i) Can diversity be further improved with an explicit objective? ii) Since fidelity and diversity are two conflicting objectives, how can we obtain different multi-objective optimal solutions according to user preferences? To answer question i), we propose a multi-objective reinforcement learning (MORL) method which explicitly takes CIDEr and Self-CIDEr scores as the fidelity-oriented and diversity-oriented rewards respectively. To answer question ii), we propose a preference-controlled MORL method, which can obtain infinite multi-objective optimal solutions by tuning the preference variable. We conduct extensive experiments on paraphrasing and image captioning tasks, which show that in the fidelity-diversity trade-off space, our model outperforms both deterministic and CVAE-based baselines.

Subject: AAAI.2023 - Speech and Natural Language Processing

#17 Learning towards Selective Data Augmentation for Dialogue Generation [PDF¹] [Copy] [Kimi] [REL]

Authors: Xiuying Chen, Mingzhe Li, Jiayi Zhang, Xiaoqiang Xia, Chen Wei, Jianwei Cui, Xin Gao, Xiangliang Zhang, Rui Yan

As it is cumbersome and expensive to acquire a huge amount of data for training neural dialog models, data augmentation is proposed to effectively utilize existing training samples. However, current data augmentation techniques on the dialog generation task mostly augment all cases in the training dataset without considering the intrinsic attributes between different cases. We argue that not all cases are beneficial for augmentation task, and the cases suitable for augmentation should obey the following two attributes: (1) low-quality (the dialog model cannot generate a high-quality response for the case), (2) representative (the case should represent the property of the whole dataset). Herein, we explore this idea by proposing a Selective Data Augmentation framework (SDA) for the response generation task. SDA employs a dual adversarial network to select the lowest quality and most representative data points for augmentation in one stage. Extensive experiments conducted on two publicly available datasets, i.e., DailyDialog and OpenSubtitles, show that our framework can improve the response generation performance with respect to various metrics

Subject: AAAI.2023 - Speech and Natural Language Processing

#18 Learn from Yesterday: A Semi-supervised Continual Learning Method for Supervision-Limited Text-to-SQL Task Streams [PDF¹] [Copy] [Kimi] [REL]

Authors: Yongrui Chen, Xinnan Guo, Tongtong Wu, Guilin Qi, Yang Li, Yang Dong

Conventional text-to-SQL studies are limited to a single task with a fixed-size training and test set. When confronted with a stream of tasks common in real-world applications, existing methods struggle with the problems of insufficient supervised data and high retraining costs. The former tends to cause overfitting on unseen databases for the new task, while the latter makes a full review of instances from past tasks impractical for the model, resulting in forgetting of learned SQL structures and database schemas. To address the problems, this paper proposes integrating semi-supervised learning (SSL) and continual learning (CL) in a stream of text-to-SQL tasks and offers two promising solutions in turn. The first solution Vanilla is to perform self-training, augmenting the supervised training data with predicted pseudo-labeled instances of the current task, while replacing the full volume retraining with episodic memory replay to balance the training efficiency with the performance of previous tasks. The improved solution SFNet takes advantage of the intrinsic connection between CL and SSL. It uses in-memory past information to help current SSL, while adding high-quality pseudo instances in memory to improve future replay. The experiments on two datasets shows that SFNet outperforms the widely-used SSL-only and CL-only baselines on multiple metrics.

Subject: AAAI.2023 - Speech and Natural Language Processing

#19 A Scope Sensitive and Result Attentive Model for Multi-Intent Spoken Language Understanding [PDF¹] [Copy] [Kimi] [REL]

Authors: Lizhi Cheng, Wenmian Yang, Weijia Jia

Multi-Intent Spoken Language Understanding (SLU), a novel and more complex scenario of SLU, is attracting increasing attention. Unlike traditional SLU, each intent in this scenario has its specific scope. Semantic information outside the scope even hinders the prediction, which tremendously increases the difficulty of intent detection. More seriously, guiding slot filling with these inaccurate intent labels suffers error propagation problems, resulting in unsatisfied overall performance. To solve these challenges, in this paper, we propose a novel Scope-Sensitive Result Attention Network (SSRAN) based on Transformer, which contains a Scope Recognizer (SR) and a Result Attention Network (RAN). SR assignments scope information to each token, reducing the distraction of out-of-scope tokens. RAN effectively utilizes the bidirectional interaction between SF and ID results, mitigating the error propagation problem. Experiments on two public datasets indicate that our model significantly improves SLU performance (5.4% and 2.1% on Overall accuracy) over the state-of-the-art baseline.

Subject: AAAI.2023 - Speech and Natural Language Processing

#20 Unsupervised Explanation Generation via Correct Instantiations [PDF¹] [Copy] [Kimi] [REL]

Authors: Sijie Cheng, Zhiyong Wu, Jiangjie Chen, Zhixing Li, Yang Liu, Lingpeng Kong

While large pre-trained language models (PLM) have shown their great skills at solving discriminative tasks, a significant gap remains when compared with humans for explanation-related tasks. Among them, explaining the reason why a statement is wrong (e.g., against commonsense) is incredibly challenging. The major difficulty is finding the conflict point, where the statement contradicts our real world. This paper proposes Neon, a two-phrase, unsupervised explanation generation framework. Neon first generates corrected instantiations of the statement (phase I), then uses them to prompt large PLMs to find the conflict point and complete the explanation (phase II). We conduct extensive experiments on two standard explanation benchmarks, i.e., ComVE and e-SNLI. According to both automatic and human evaluations, Neon outperforms baselines, even for those with human-annotated instantiations. In addition to explaining a negative prediction, we further demonstrate that Neon remains effective when generalizing to different scenarios. The resources of Neon are available at: https://github.com/Shark-NLP/Neon.

Subject: AAAI.2023 - Speech and Natural Language Processing

#21 Prompt-Augmented Linear Probing: Scaling beyond the Limit of Few-Shot In-Context Learners [PDF¹] [Copy] [Kimi] [REL]

Authors: Hyunsoo Cho, Hyuhng Joon Kim, Junyeob Kim, Sang-Woo Lee, Sang-goo Lee, Kang Min Yoo, Taeuk Kim

Through in-context learning (ICL), large-scale language models are effective few-shot learners without additional model fine-tuning. However, the ICL performance does not scale well with the number of available training sample as it is limited by the inherent input length constraint of the underlying language model. Meanwhile, many studies have revealed that language models are also powerful feature extractors, allowing them to be utilized in a black-box manner and enabling the linear probing paradigm, where lightweight discriminators are trained on top of the pre-extracted input representations. This paper proposes prompt-augmented linear probing (PALP), a hybrid of linear probing and ICL, which leverages the best of both worlds. PALP inherits the scalability of linear probing and the capability of enforcing language models to derive more meaningful representations via tailoring input into a more conceivable form. Throughout in-depth investigations on various datasets, we verified that PALP significantly closes the gap between ICL in the data-hungry scenario and fine-tuning in the data-abundant scenario with little training overhead, potentially making PALP a strong alternative in a black-box scenario.

Subject: AAAI.2023 - Speech and Natural Language Processing

#22 Neural Dynamic Focused Topic Model [PDF²] [Copy] [Kimi] [REL]

Authors: Kostadin Cvejoski, Ramsés J. Sánchez, César Ojeda

Topic models and all their variants analyse text by learning meaningful representations through word co-occurrences. As pointed out by previous work, such models implicitly assume that the probability of a topic to be active and its proportion within each document are positively correlated. This correlation can be strongly detrimental in the case of documents created over time, simply because recent documents are likely better described by new and hence rare topics. In this work we leverage recent advances in neural variational inference and present an alternative neural approach to the dynamic Focused Topic Model. Indeed, we develop a neural model for topic evolution which exploits sequences of Bernoulli random variables in order to track the appearances of topics, thereby decoupling their activities from their proportions. We evaluate our model on three different datasets (the UN general debates, the collection of NeurIPS papers, and the ACL Anthology dataset) and show that it (i) outperforms state-of-the-art topic models in generalization tasks and (ii) performs comparably to them on prediction tasks, while employing roughly the same number of parameters, and converging about two times faster.

Subject: AAAI.2023 - Speech and Natural Language Processing

#23 Improving Simultaneous Machine Translation with Monolingual Data [PDF¹] [Copy] [Kimi] [REL]

Authors: Hexuan Deng, Liang Ding, Xuebo Liu, Meishan Zhang, Dacheng Tao, Min Zhang

Simultaneous machine translation (SiMT) is usually done via sequence-level knowledge distillation (Seq-KD) from a full-sentence neural machine translation (NMT) model. However, there is still a significant performance gap between NMT and SiMT. In this work, we propose to leverage monolingual data to improve SiMT, which trains a SiMT student on the combination of bilingual data and external monolingual data distilled by Seq-KD. Preliminary experiments on En-Zh and En-Ja news domain corpora demonstrate that monolingual data can significantly improve translation quality (e.g., +3.15 BLEU on En-Zh). Inspired by the behavior of human simultaneous interpreters, we propose a novel monolingual sampling strategy for SiMT, considering both chunk length and monotonicity. Experimental results show that our sampling strategy consistently outperforms the random sampling strategy (and other conventional typical NMT monolingual sampling strategies) by avoiding the key problem of SiMT -- hallucination, and has better scalability. We achieve +0.72 BLEU improvements on average against random sampling on En-Zh and En-Ja. Data and codes can be found at https://github.com/hexuandeng/Mono4SiMT.

Subject: AAAI.2023 - Speech and Natural Language Processing

#24 Domain-Adapted Dependency Parsing for Cross-Domain Named Entity Recognition [PDF²] [Copy] [Kimi] [REL]

Authors: Chenxiao Dou, Xianghui Sun, Yaoshu Wang, Yunjie Ji, Baochang Ma, Xiangang Li

In recent years, many researchers have leveraged structural information from dependency trees to improve Named Entity Recognition (NER). Most of their methods take dependency-tree labels as input features for NER model training. However, such dependency information is not inherently provided in most NER corpora, making the methods with low usability in practice. To effectively exploit the potential of word-dependency knowledge, motivated by the success of Multi-Task Learning on cross-domain NER, we investigate a novel NER learning method incorporating cross-domain Dependency Parsing (DP) as its auxiliary learning task. Then, considering the high consistency of word-dependency relations across domains, we present an unsupervised domain-adapted method to transfer word-dependency knowledge from high-resource domains to low-resource ones. With the help of cross-domain DP to bridge different domains, both useful cross-domain and cross-task knowledge can be learned by our model to considerably benefit cross-domain NER. To make better use of the cross-task knowledge between NER and DP, we unify both tasks in a shared network architecture for joint learning, using Maximum Mean Discrepancy(MMD). Finally, through extensive experiments, we show our proposed method can not only effectively take advantage of word-dependency knowledge, but also significantly outperform other Multi-Task Learning methods on cross-domain NER. Our code is open-source and available at https://github.com/xianghuisun/DADP.

Subject: AAAI.2023 - Speech and Natural Language Processing

#25 MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic Parsing [PDF¹] [Copy] [Kimi] [REL]

Authors: Longxu Dou, Yan Gao, Mingyang Pan, Dingzirui Wang, Wanxiang Che, Dechen Zhan, Jian-Guang Lou

Text-to-SQL semantic parsing is an important NLP task, which facilitates the interaction between users and the database. Much recent progress in text-to-SQL has been driven by large-scale datasets, but most of them are centered on English. In this work, we present MultiSpider, the largest multilingual text-to-SQL semantic parsing dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese). Upon MultiSpider we further identify the lexical and structural challenges of text-to-SQL (caused by specific language properties and dialect sayings) and their intensity across different languages. Experimental results under various settings (zero-shot, monolingual and multilingual) reveal a 6.1% absolute drop in accuracy in non-English languages. Qualitative and quantitative analyses are conducted to understand the reason for the performance drop of each language. Besides the dataset, we also propose a simple schema augmentation framework SAVe (Schema-Augmentation-with-Verification), which significantly boosts the overall performance by about 1.8% and closes the 29.5% performance gap across languages.

Subject: AAAI.2023 - Speech and Natural Language Processing